Version: 2024.05-cpu

Resource allocation

resource.json

Set up computing cluster resources for training, molecular dynamics (MD), and DFT calculations (SCF, Relax, AIMD). This includes specifying the computing nodes, CPU/GPU resources, and corresponding software (Lammps, VASP, PWMAT, PWMLFF).

It is divided into three module parameters: train, explore, and DFT. For initial training set preparation (init_bulk), only DFT needs to be set, as shown in the following JSON dict.

{
  "train": {
    "command": "",
    "group_size": 1,
    "_parallel_num": 1,
    "number_node": 1,
    "gpu_per_node": 1,
    "cpu_per_node": 1,
    "queue_name": "3080ti,3090",
    "custom_flags": [],
    "source_list": [],
    "module_list": []
  },
  "explore": {
    "command": "mpirun -np 8 lmp_mpi -in in.lammps",
    "group_size": 1,
    "number_node": 1,
    "gpu_per_node": 1,
    "cpu_per_node": 1,
    "queue_name": "3080ti,3090",
    "custom_flags": [],
    "source_list": [],
    "module_list": []
  },
  "DFT": {
    "command": "mpirun -np 4 PWmat",
    "number_node": 1,
    "cpu_per_node": 4,
    "gpu_per_node": 4,
    "group_size": 1,
    "queue_name": "3080ti,3090",
    "custom_flags": [],
    "source_list": [],
    "module_list": []
  }
}

param details

command

A mandatory parameter that sets the corresponding command for the module. Different task configurations, examples:

For DFT calculations:

PWmat settings:
"command": "mpirun -np 4 PWmat"
VASP settings:
"command": "vasp_std"
cp2k settings:
"command": "mpirun -np $SLURM_NTASKS cp2k.popt"

For Lammps calculations (GPU version and CPU version):

"command": "mpirun -np 1 lmp_mpi_gpu"
"command": "mpirun -np 10 lmp_mpi"

The number after -np represents the number of GPUs or CPUs used, which should be consistent with the gpu_per_node or cpu_per_node settings.

For model training settings in PWMLFF documentation:

"command": "PWMLFF"

group_size

This parameter is used for grouping multiple computing tasks, where tasks within the same group are executed sequentially while tasks between groups are parallelized. The default value is 1, which means no grouping.

For example, if there are 34 self-consistent calculation tasks and "group_size":5 is set, the 34 self-consistent calculation tasks will be divided into 6 groups, resulting in 6 slurm tasks, each containing 5 self-consistent calculations (the last group will have 4 calculation tasks). During execution, the 6 slurm tasks will be submitted to the computing cluster simultaneously (the 5 self-consistent calculations within each slurm task will be executed sequentially).

In the train module, this parameter is automatically set to 1.

number_node

Used to set the number of computing nodes for each slurm task. The default value is 1, indicating 1 computing node.

In the train module, this parameter is automatically set to 1.

gpu_per_node

Used to set the number of GPUs used per node. The default value is 0. If PWMAT is used for DFT calculations (self-consistent calculations, relaxation, or AIMD), this value needs to match the number of GPUs set in "command".

In the train module, this parameter is automatically set to 1.

cpu_per_node

Used to set the number of CPUs used per node. The default value is 1. Note that this value needs to be ">= gpu_per_node".

In the train module, this parameter is automatically set to 1.

A required parameter used to set the compute cluster partition(s) to be used. It is a string list. For example, "queue_name":"cpu, 3080ti, 3090".

custom_flags

Used to set additional #SBATCH parameters in the Slurm script. It is an optional parameter in list format. For example, for

    "custom_flags": [
            "#SBATCH -x gn43,gn66,login"
    ]

During execution, the string "#SBATCH -x gn43,gn66,login" will be automatically appended to the Slurm script.

source_list

Used to set the environment variables that need to be imported during the execution of the Slurm script. It is an optional parameter in list format. For example, for

    "source_list": [
        "source /opt/rh/devtoolset-8/enable"
    ]

During execution, the strings "source /opt/rh/devtoolset-8/enable" and "export PATH=/path/PWMLFF/src/bin:$PATH" will be automatically written into the Slurm script.

module_list

Used to set the software modules that need to be loaded during the execution of the Slurm script. It is an optional parameter in list format.

For example, for

    "module_list": [
        "cuda/11.6",
        "intel/2020"
    ]

During execution, the strings "module load cuda/11.6" and "module load intel/2020" will be automatically written into the Slurm script.

env_list

Used to set the environment information that the slurm script needs to load at runtime. It is an optional parameter in list format.

For example, for

    "env_list": [
        "export PATH=~/codespace/PWMLFF_feat/src/bin:$PATH",
        "export PYTHONPATH=~/codespace/PWMLFF_feat/src/:$PYTHONPATH",
    ]

During execution, two complete strings will be written into the slurm script.

According to the settings in queue_name,custom_flags, source_list, module_list, env_list mentioned above, the generated slurm script contains the following content.

#SBATCH --partition=3080ti,3090
#SBATCH -x gn43,gn66,login

source /opt/rh/devtoolset-8/enable
module load cuda/11.6
module load intel/2020
export PATH=~/codespace/PWMLFF_feat/src/bin:$PATH
export PYTHONPATH=~/codespace/PWMLFF_feat/src/:$PYTHONPATH

Configuration Examples in Detail

Train Module

For the train module, it is necessary to load the Python runtime environment of PWMLFF. If you are using PWMLFF installed on MCLOUD for training, the corresponding settings are as follows:

"train": {
  "command": "PWMLFF",
  "group_size": 1,
  "number_node": 1,
  "gpu_per_node": 1,
  "cpu_per_node": 1,
  "queue_name": "3080ti,3090",
  "custom_flags": [],
  "source_list": [
    "/share/app/PWMLFF/PWMLFF2024.5/env.sh"
  ],
  "env_list": [
  ],
  "module_list": [
  ]
}

Here, 1 compute node is used with 1 GPU and 1 CPU. The node is located in the partition 3080ti, or 3090.

If you compile and install PWMLFF from source code, taking my computer cluster environment configuration as an example, the corresponding settings are as follows:

"train": {
  "command": "PWMLFF",
  "group_size": 1,
  "number_node": 1,
  "gpu_per_node": 1,
  "cpu_per_node": 1,
  "queue_name": "3080ti,3090",
  "custom_flags": [],
  "source_list": [
    "~/anaconda3/etc/profile.d/conda.sh"
  ],
  "env_list": [
    "conda activate torch2_feat",
    "export PATH=~/codespace/PWMLFF_feat/src/bin:$PATH",
    "export PYTHONPATH=~/codespace/PWMLFF_feat/src/:$PYTHONPATH"
  ],
  "module_list": [
    "cuda/11.6",
    "intel/2020"
  ]
}

Here, "~/anaconda3/etc/profile.d/conda.sh" is the conda loading path in my computer cluster, torch2_feat is the Python environment of PWMLFF, and ~/codespace/PWMLFF_feat is the path where the source code is located.

Explore Module

For the explore module, if we take LAMMPS installed on MCLOUD as an example, simply load the lammps4pwmlff software. The complete settings are as follows:

"explore": {
  "command": "mpirun -np 1 lmp_mpi_gpu",
  "group_size": 2,
  "number_node": 1,
  "gpu_per_node": 1,
  "cpu_per_node": 1,
  "queue_name": "3080ti,3090",
  "custom_flags": [],
  "source_list": [],
  "module_list": [
    "lammps4pwmlff"
  ],
  "env_list": []
}

Here, 1 compute node is used with 1 GPU and 1 CPU. Every 2 LAMMPS tasks are grouped into 1 group.

If you compile and install LAMMPS from source code, taking my computer cluster environment configuration as an example, the corresponding settings are as follows:

"explore": {
  "command": "mpirun -np 1 lmp_mpi_gpu",
  "group_size": 2,
  "number_node": 1,
  "gpu_per_node": 1,
  "cpu_per_node": 1,
  "queue_name": "3080ti,3090",
  "custom_flags": [],
  "source_list": [
    "~/anaconda3/etc/profile.d/conda.sh"
  ],
  "module_list": [
    "cuda/11.6",
    "intel/2020"
  ],
  "env_list": [
    "conda activate torch2_feat",
    "export PATH=~/codespace/PWMLFF_feat/src/bin:$PATH",
    "export PYTHONPATH=~/codespace/PWMLFF_feat/src/:$PYTHONPATH",
    "export PATH=~/codespace/lammps_torch/src:$PATH",
    "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(python3 -c \"import torch; print(torch.__path__[0])\")/lib:$(dirname $(dirname $(which python3)))/lib:$(dirname $(dirname $(which PWMLFF)))/op/build/lib"
  ]
}

DFT Module

For the DFT module, let's take loading PWMAT as an example, with the following settings:

"DFT": {
  "command": "PWmat",
  "number_node": 1,
  "cpu_per_node": 4,
  "gpu_per_node": 4,
  "group_size": 5,
  "queue_name": "3080ti,1080ti,3090",
  "custom_flags": [
    "#SBATCH -x gn18,gn17"
  ],
  "module_list": [
    "compiler/2022.0.2",
    "mkl/2022.0.2",
    "mpi/2021.5.1"
  ],
  "env_list": [
    "module load cuda/11.6"
  ]
}

Here, 1 compute node is used with 4 GPUs and 4 CPUs. Every 5 DFT tasks are grouped into 1 group.

Resource allocation

resource.json​

param details​

command​

group_size​

number_node​

gpu_per_node​

cpu_per_node​

custom_flags​

source_list​

module_list​

env_list​

Configuration Examples in Detail

Train Module​

Explore Module​

DFT Module​

resource.json

param details

command

group_size

number_node

gpu_per_node

cpu_per_node

custom_flags

source_list

module_list

env_list

Train Module

Explore Module

DFT Module